Controlled Variable Selection from Summary Statistics Only? A Solution via GhostKnockoffs and Penalized Regression

Identifying which variables do influence a response while controlling false positives pervades statistics and data science. In this paper, we consider a scenario in which we only have access to summary statistics, such as the values of marginal empirical correlations between each dependent variable of potential interest and the response. This situation may arise due to privacy concerns, e.g., to avoid the release of sensitive genetic information. We extend GhostKnockoffs He et al. [2022] and introduce variable selection methods based on penalized regression achieving false discovery rate (FDR) control. We report empirical results in extensive simulation studies, demonstrating enhanced performance over previous work. We also apply our methods to genome-wide association studies of Alzheimer’s disease, and evidence a significant improvement in power.


Introduction 1.Background and contributions
Modern large-scale studies frequently involve a multitude of explanatory variables potentially associated with an outcome we would like to better understand.Oftentimes, the goal is to select those explanatory variables that are meaningfully associated with the response variable.For instance, with recent advances in genome sequencing technologies and genotype imputation techniques, one can now gather tens of millions of variants from hundreds of thousands of samples in large-scale genetic studies, with the aim of pinpointing which genetic variants are biologically associated with specific diseases.This information could provide mechanistic insights and potentially aid the development of targeted drugs.In statistics, this challenge is typically framed as a multiple testing problem.Further, due to the sheer number of hypotheses considered and the cost of following false leads, it is generally required to control some form of error rate on the false positives.
In this paper, we focus on controlling the false discovery rate (FDR), which is the expected proportion of false selections among all selected variables.Compared to the more stringent familywise error rate (FWER) control, keeping

Code availability and reproducibility
The software and example code that reproduce the results presented in this paper can be found at https://github.com/biona001/ghostknockoff-gwas-reproducibility/tree/main/chen_et_al.Simulation results in Section 3.5, Section 4.4.2 and Section 4.4.3 can be exactly reproduced.Due to data accessibility issue, we only provide code without real data for Section 4.4.1 and Section 5.

Model-X Knockoffs and GhostKnockoffs
To begin with, we define the controlled variable selection problem and give a brief review of model-X knockoffs and GhostKnockoffs.For a more detailed exposition, we refer readers to Candès et al. [2018], Barber and Candès [2015], and He et al. [2022].In the following, we use boldface letters for vectors and matrices.* We use Xj ∈ R n and xi ∈ R p to respectively represent the jth column and ith row of the covariate matrix X.

Problem statement
Given covariates X ∈ R p and a response Y ∈ R, we are interested in understanding which variables influence Y .We formulate this selection problem as testing the conditional independence hypotheses H j 0 : Xj ⊥ ⊥ Y | X−j for 1 ≤ j ≤ p, where X−j is a shorthand for all the variables except the jth; that is X−j = {X1, ..., Xj−1, Xj+1, ..., Xn}.In words, we should reject H j 0 if we believe that Xj can help better predict the outcome than if we only had available the values of all the other variables.Put differently, Xj has information about Y which cannot be subsumed by the information contained in all the other variables.By conditioning on X−j, these hypothesis tests aim to weed out variables whose relationship to Y is driven by residual correlations with other covariates.
Let H0 ⊂ [p] be the set of indices for which the null conditional independence hypothesis H j 0 is true, and let S ⊂ [p] be the set of indices of the hypotheses rejected by a selection procedure.The false discovery rate (FDR) is the expected fraction of false positives among the selected, defined as with the convention that 0/0 = 0. Our goal is to make as many rejections as possible while controlling the FDR below a user-specified level q.
In this paper, we consider the setting in which, instead of observing i.i.d.samples from the distribution of (X, Y ), we only have some summary statistics of the i.i.d.samples.In particular, we will show how one can, quite remarkably, perform tests of conditional independence when we do not directly observe the i.i.d.samples.Throughout this paper, we assume that X ∼ N (0, Σ) where Σ is known (or, in practice, can be estimated).

The procedure
Suppose we observe n i.i.d.samples (Xi, Yi), 1 ≤ i ≤ n, arranged in a data matrix X ∈ R n×p and response vector Y ∈ R n .In the model-X knockoffs framework Candès et al. [2018], we assume we know the distribution PX of the covariates X while having no knowledge of the conditional distribution Y | X.The model-X approach is well-suited to genetic applications where reference panels may be available to estimate PX or where we have good models of linkage disequilibrium.
To implement model-X knockoffs, we first generate a matrix ‹ X ∈ R n×p of knockoffs such that the following two conditions hold: Roughly, the first says that we cannot distinguish between [X ‹ X] and [X ‹ X] swap(j) , where [X ‹ X] swap(j) is obtained from [X ‹ X] by swapping the jth and (j + p)th columns.The second condition implies that ‹ X does not provide any new information about Y conditional on X and is guaranteed if ‹ X is constructed without looking at Y. If these properties hold, it can be shown that Xj and ‹ Xj are indistinguishable conditional on Y for each j ∈ H0.Next, we define feature importance statistics W = w([X, ‹ X], Y) ∈ R p to be any function of X, ‹ X and Y such that a flip-sign property holds; namely, switching a column Xj with its knockoff ‹ Xj flips the sign of the jth component of the output; formally, wj([X, ‹ X] swap(j) , Y) = −wj([X, ‹ X], Y).Common choices include Wj = |X ⊤ j Y| − | ‹ X ⊤ j Y| (marginal correlation difference statistic) and Wj = | βj(λCV)| − | βj+p(λCV)| (Lasso coefficient difference statistic), where β(λCV) is the solution to the Lasso problem arg min and λCV is usually chosen by cross-validation.Finally, the knockoff filter selects the variables S = {j : Wj ≥ T }, where Here, W = {|Wj| : j = 1, ..., p}\{0}, and T = +∞ if W is empty.Intuitively, the threshold T is chosen to be the most liberal one such that an estimate of FDP is bounded by q.Candès et al. [2018] showed that this procedure controls the FDR of the conditional testing problem at level q.

Gaussian knockoff sampler
Under the assumption that the rows of the data matrix X are i.i.d.from the Gaussian distribution N (0, Σ), we can generate a knockoff vector xi for each row xi of the data matrix X by sampling xi ∼ N (P ⊤ xi, V) independently across rows, where P = I − Σ −1 D, V = 2 D − DΣ −1 D, D = diag{s}, and s ∈ R p is a vector of free parameters usually obtained by solving a convex optimization problem that depends on Σ [Candès et al., 2018].See Appendix A for details of computing s.Concatenating all the knockoff vectors then gives a valid matrix ‹ X ∈ R n×p of knockoffs.In matrix form, the construction above is where E is an n by p matrix with i.i.d.standard Gaussian entries, independent of X and Y.For later reference, we summarize the Gaussian knockoff sampler in Algorithm 1 and denote it as G.

GhostKnockoffs with marginal correlation difference statistic
The original model-X knockoffs procedure relies on having access to the covariates and responses from all data points, i.e., the matrix of covariates X and the response vector Y. Henceforth, we call these individual-level data.In many application scenarios, however, individual-level data are not available due to privacy concerns.Instead, we only have access to some summary statistics of X and Y, e.g., the empirical covariance matrix of the covariaties and the empirical covariance between each covariate and the response.He et al. [2022] proposed GhostKnockoffs, which implements the knockoffs procedure with marginal correlation difference statistic when only X ⊤ Y and ||Y|| 2 2 are available.The key idea of He et al. [2022] is to sample the knockoff Z-score Zs from X ⊤ Y and ||Y|| 2 2 directly, in a way such that where ‹ X = G(X, Σ) is the knockoff matrix generated by the Gaussian knockoff sampler (Algorithm 1).If we use W = Zs − Zs (where Zs = X ⊤ Y) as the feature importance statistic and run the knockoff filter, the resulting rejection set will have the same distribution as that of the knockoffs procedure with marginal correlation difference statistic.Therefore, the two procedures are statistically identical.In particular, they both control the FDR.Specifically, He et al. [2022] showed that for P and V computed in step 3 of Algorithm 1, satisfies (5) as detailed in Appendix B. All this is summarized in Algorithm 2. In the following sections, we refer to Algorithm 2 as GhostKnockoffs with marginal correlation difference statistic (GK-marginal ).

Setting
As we have just seen, GhostKnockoffs-marginal gives a way to test conditional hypotheses while maintaining FDR control when only the summary statistics X ⊤ Y and ∥Y∥ 2 2 are available to the analyst.Now, we consider the setting in which we have knowledge of the empirical covariance matrix X ⊤ X and the sample size n, in addition to X ⊤ Y and ∥Y∥ 2 2 .These quantities only reveal sample averages of relevant quantities, as opposed to all the individual-level information.
In this section, we propose a variable selection method that utilizes only X ⊤ X, X ⊤ Y, ∥Y∥ 2 2 , and n.Our method achieves FDR control and power comparable to the knockoffs procedure with the cross-validated Lasso coefficient difference statistic defined in Section 2. This is interesting because the latter usually outperforms GhostKnockoffs with the marginal correlation difference statistic by a significant margin.Notably, for a fixed tuning parameter λ, we show that our procedure is equivalent to a corresponding knockoffs method using the Lasso coefficient difference statistic with the same penalty level λ.

GhostKnockoffs with the Lasso
Recall that in the knockoffs procedure with the Lasso coefficient difference statistic, we solve the optimization problem where ‹ X = G(X, Σ).We then define the Lasso coefficient difference feature importance statistics by If we have access to individual-level data, λ is usually chosen by cross-validation (Candès et al. [2018] and Weinstein et al. [2020]).* As a first step, we would like to run a statistically equivalent procedure using X ⊤ X, X ⊤ Y, ∥Y∥ 2 2 , and n for a fixed λ.Note that, with λ fixed, (7) depends on the data only through In the case that Y is binary, one may think that utilizing (penalized) logistic regression would give much better power than Lasso.In Appendix P, we show that this intuition may not be correct through simulations, even when Y is generated according to a logistic regression model.
The Gram matrix can of course be equivalently reconstructed from (∥Y∥ . The main idea is to sample from the joint distribution of T (X, ‹ X, Y) using the Gram matrix of [X, Y] only.Based on this, we can then generate the solution to the Lasso problem (7) (in distribution) for a fixed λ. * This is achieved via the following Proposition 1, which says in words that if we generate 'fake' data matrices q X and q Y that lead to the same Gram matrix as that of X and Y, then the distribution of T remains unchanged if we replace the original data matrices by the fake data matrices.
Proof of Proposition 1 is provided in Appendix C. Specifically, Proposition 1 suggests that summary statistics (X ⊤ X, X ⊤ Y, ||Y|| 2 2 , Σ) are sufficient for sampling the Gram matrix T (X, ‹ X, Y).
4: Run the standard knockoffs procedure (at level q) with the Lasso coefficient difference statistic on q X and ‹ q X for a fixed penalty level λ or use the methods from Sections 3.3 and 3.4.
We are now able to write down a procedure, namely, Algorithm 3, which is statistically equivalent to the corresponding individual-level knockoffs procedure using the Lasso coefficient difference statistic (or any statistic defined in Sections 3.3 and 3.4).In step 2, q X and q Y can be obtained by performing the eigen-decomposition or Cholesky decomposition of [X Y] ⊤ [X Y].Brief procedures to construct q X and q Y via eigen-decomposition are provided in Appendix D. All we need to do is to run the knockoffs procedure with q X and ‹ q X in lieu of X and ‹ X.We say that the procedure is equivalent since the rejection sets have the same distribution.In particular, this proves that Algorithm 3 controls the FDR.
) and an independent random variable U. Define " W = f (T ( q X, ‹ q X, Y), U).Let S1 (resp.S2) be the rejection set obtained from applying the knockoffs filter on W (resp. " Thus, if W obeys the flip-sign property, both procedures have equal FDR at most equal to q.
Therefore, the procedures have the same FDR.
* Careful readers may realize that the solution of the Lasso problem does not depend on ∥Y∥ 2 2 .Here we include ∥Y∥ 2 2 as an input of to be able to make a more general statement later that goes beyond the Lasso.
We can easily adapt the method above to accommodate other types of regularization, such as Ridge regression and Elastic Net.

GhostKnockoffs with the square-root Lasso
In Section 3.2, we assumed that the tuning parameter λ in ( 7) is fixed.In practice, one may choose the penalty level using information from the Gram matrix of [X, Y], and the sample size n.Since individual-level data is not available, we are unable to use data-splitting approaches such as cross-validation.
An alternative way to define feature importance is to use the square-root Lasso [Belloni et al., 2011], for which the choice of a reasonable tuning parameter is convenient.The square-root Lasso applied to the knockoffs setting solves β(λ) ∈ arg min and a good choice of λ is given by where ϵ ∼ N (0, In) and κ is a unitless hyperparameter [Tian et al., 2018].This value is a scalar multiple of the expected value of the minimal penalty level required such that all the coefficients are shrunk to zero under the global null model.The square-root Lasso has the benefit that the value of the hyperparameter does not depend on the details of the distribution of Y conditional on X.We also found that the performance of our procedure does not depend very sensitively on the choice of κ.In our data examples, we take κ = 0.3.
In the setting where we only know about values of the summary statistics, we simply replace (X, ‹ X, Y) by ( q X, ‹ q X, q Y) in ( 8).Further, we note that for any orthogonal matrix Q, where the second equality follows from Q ⊤ ϵ d = ϵ.Therefore, the value of the hyperparameter in (9) remains unchanged if we multiply [X ‹ X] by Q on the left.This implies that (9) is a deterministic function of [X ‹ X] ⊤ [X ‹ X].Hence, the feature importance statistic is a function of T (X, ‹ X, Y).Following Corollary 1, we can apply the knockoffs procedure with the square-root Lasso and matrices ( q X, ‹ q X) in lieu of (X, ‹ X).Upon choosing we get a procedure, which is statistically indistinguishable from that we would get if we were performing all the same steps with X and ‹ X.(In practice, we compute the value in (10) via Monte Carlo simulation.)In the sequel, we call the resulting procedure summary statistics GhostKnockoffs with square-root Lasso importance statistic (GK-sqrtlasso).Note that GK-sqrtlasso controls the FDR as the flip-sign property of the feature importance statistic holds.This is because swapping a variable with its knockoff does not change the value of the hyperparameter.Therefore, by Corollary 1, applying the knockoff filter to the square-root Lasso feature importance statistics yields FDR control.

GhostKnockoffs with the Lasso-max
In the standard fixed-X knockoffs setting, cross-validation is also not feasible, since doing so would violate the sufficiency condition required for the feature importance statistics.As one possible alternative, Barber and Candès [2015] considered using as the feature importance statistic the value of λ on the Lasso path at which feature Xj first enters the model.Formally, they define the feature importance statistic Wj = sup{λ : βj(λ) ̸ = 0} − sup{λ : βj+p(λ) ̸ = 0}, where β(λ) is as in (7).We call this statistic the Lasso-max statistic.Intuitively, a larger penalty level is required to shrink an important feature to zero, so we should expect Wj to be large and positive for non-nulls.
By Corollary 1, with the Lasso-max statistic Algorithm 3 produces a rejection set that has the same distribution as the rejection set obtained from the corresponding individual-data-based knockoffs procedure.We call this summarystatistic-based procedure GhostKnockoffs with Lasso-max statistic (GK-lassomax ).
We remark that choices of other tuning parameters and feature importance statistics are also possible.For instance, we may choose λ to minimize the Stein's unbiased risk estimate (SURE) associated with (7).We shall however focus on the two approaches we have described.

Numerical simulations
We consider a variety of simulation settings in which we compare the performance of the proposed GhostKnockoffs with square-root Lasso and Lasso-max statistics (GK-sqrtlasso and GK-lassomax, defined in Sections 3.3 and 3.4), GhostKnockoffs with marginal correlation difference statistic (GK-marginal, defined in Section 2), and the knockoffs procedure with (cross-validated) Lasso coefficient difference statistic with individual-level data (KF-lassocv).Note that the first three are statistically equivalent to the corresponding knockoffs procedures with individual-level data.

Independent features
In the first set of simulations (Figure 1), we generate random samples xi iid ∼ N (0, Ip) and Yi = β ⊤ xi + √ nϵi, where ϵi iid ∼ N (0, 1) for i ∈ {1, 2, ..., n}.* We consider three settings of varying dimensionality measured by the ratio p/n: (n, p) ∈ {(600, 200), (400, 400), (200, 600)}.In each of the three settings, we create a sparse vector β by selecting 30 coordinates to be non-zero uniformly at random.The signs of these non-zero coordinates are assigned to be either positive or negative with equal probability.We vary the signal amplitudes such that we explore a wide power range below.For the square-root Lasso, we average over 200 Monte Carlo samples to calculate The target FDR is 20%.Each point on the curves represents the average of the results from 200 replications.We observe that GK-sqrtlasso and GK-lassomax generally demonstrate greater power than GK-marginal.This enhanced performance is not surprising, as GK-sqrtlasso and GK-lassomax (1) have access to additional information via X ⊤ X, and (2) employing a joint modeling algorithm such as Lasso generally provides a better assessment of variable importance for understanding conditional (in)dependence since such a model explicitly adjusts for the effects from all the other variables.We also note the presence of power gaps between GK-lassocv and GK-sqrtlasso/GKlassomax, likely due to the fact that we are unable to perform cross-validation without individual-level data.All methods control the FDR at the desired level.
Again, we observe that GK-sqrtlasso and GK-lassomax generally have greater power than GK-marginal.All methods have (almost) decreasing power as the autocorrelation coefficient increases, since it becomes harder to separate true signals from null variables that are correlated with them.All methods control the FDR at the desired level.4 GhostKnockoffs with Penalized Regression: Missing Empirical Covariance

Setting
Thus far, we have discussed how incorporating the additional information from X ⊤ X and n could enhance our ability to detect significant features.However, in applications such as genetics, X ⊤ X may not be available.In this section, we propose alternative procedures when the scientist only knows about X ⊤ Y, ∥Y∥ 2 and the sample size n.As before, we assume that X ∼ N (0, Σ), where the covariance matrix Σ is known (or can be estimated from other data sources).

GhostKnockoffs with pseudo-lasso
The idea of our method is to modify the Lasso objective function so that it can be constructed from the available summary statistics.It turns out that the solution of our modified objective function is proportional to that of the scout procedure (with known precision matrix) proposed by Witten and Tibshirani [2009].We will see through simulation studies that our procedure improves the power of the original GhostKnockoffs method of [He et al., 2022] while maintaining FDR control.

The procedure
Recall that in the knockoffs procedure with the Lasso statistic, we solve the following optimization problem: To mimic the form of the loss function when we do not observe the empirical covariance of the features, we may want to substitute them with their population version: i.e. we swap X ⊤ X/n and ‹ X ⊤ ‹ X/n with Σ and X ⊤ ‹ X/n with Σ − D. As usual, D = diag{s} is obtained by solving the convex optimization problem (15).In the language of fixed-X knockoffs [Barber and Candès, 2015], this is equivalent to regarding ‹ X as a fixed-X knockoff of X and replacing X ⊤ X/n by Σ. * This yields Algorithm 4.
Algorithm 4 GhostKnockoffs with Penalized Regression: Missing Empirical Covariance , where D and P are defined as in Section 2.2.2 and λ is fixed or as chosen in Section 4.2.2 4: Run the standard knockoffs procedure (at level q) with importance statistic We call this procedure GhostKnockoffs with pseudo-lasso statistic (GK-pseudolasso).We show below that Algorithm 4 controls the FDR of selections at level q.Before doing so, we first state a general proposition that includes GK-marginal as a special case.
Proposition 2. Suppose V and P are defined as in Algorithm 2, Z ∼ N (0, V) is independent of X and Y, and ‹ * We remark that similar objective functions have been used in, for example, Mak et al. [2017] and Zou et al. [2022].
Proof.In Appendix B, we prove that Therefore, the procedures have the same FDR.
Set λ to be a fixed numerical constant.Consider the feature importance statistics W defined by and ‹ X = G(X, Σ) is the Gaussian knockoff data matrix.The feature importance statistic in Algorithm 4 is thus obtained by replacing ‹ , it follows from Proposition 2 that the rejection set of Algorithm 4 has the same distribution as that obtained from running the knockoff filter on W.
Thus to prove that Algorithm 4 controls the FDR of rejections at level q, it suffices to verify the flip-sign property of the feature importance statistic for W (see Section 2).This is a consequence of the following lemma: Let ΠS be any permutation matrix which swaps the jth and (j+p)th entries of a 2p-dimensional vector for each j ∈ S ⊂ {1, ..., p}.Assume that C is S-swap invariant in the sense that Π ⊤ S CΠS = C. Then β is a solution to (12) if and only if ΠS β is a solution to the same problem with d and ΠSd swapped.In other words, swapping the entries of d has the effect of swapping the corresponding entries of the solution.
Proof.Consider the objective with problem data ΠSd: where the equality follows because Π ⊤ S CΠS = C and because the 1-norm and 2-norm are invariant under permutation.Now, the objective on the right-hand side is the objective with data d.If β is the solution with data d, it follows that ΠS β is the solution with data ΠSd, and vice versa.This proves the lemma.
Corollary 2. Algorithm 4 with a fixed λ controls the FDR of rejections at level q.
Proof.It is easy to show that in Lemma 1 establishes the flip-sign property of W and, therefore, the FDR control of Algorithm 4 for a fixed λ.
In practice, to ensure numerical stability, we add a small positive constant multiple of the identity matrix to when solving for β.This is equivalent to incorporating a small Ridge penalty into the objective function.It is easy to see that the lemma proved above guarantees that this modification does not compromise the FDR control as is also S-swap invariant for any c ∈ R and any S ⊂ {1, ..., p}.

Choice of tuning parameter
Several methods can be used to tune the value of the hyperparameter λ.We here consider two approaches.
Method 1 (lasso-min) Pretend a homogeneous Gaussian linear model holds, i.e.Y = Xβ * + σϵ for some Focus on (11) first and imagine that we have a method for computing λ that depends on data only through ∥Y∥ 2 2 , X ⊤ Y, and ‹ X ⊤ Y.Note that the objective in Algorithm 4 only substitutes ‹ X ⊤ Y in (11) with P ⊤ X ⊤ Y+∥Y∥2Z.Therefore, by Proposition 2 if we set λ via the same functional and work with P ⊤ X ⊤ Y + ∥Y∥2Z in lieu of ‹ X ⊤ Y, we shall achieve FDR control with this data-driven value of the hyperparameter λ.This holds of course with the proviso that our selection of hyperparameter is symmetric in the sense that it produces feature importance statistic obeying the flip-sign property.
To set the tuning parameter λ0 in (11), we use the common choice of taking a constant multiple of the expected value of the minimum λ value such that β(λ) = 02p under the null model Y = σϵ.By the Karush-Kuhn-Tucker (KKT) conditions [Boyd and Vandenberghe, 2004], this results in a tuning parameter of the form where κ is a hyperparameter between 0 and 1.Since î X ‹ X ó is a data matrix whose rows are iid samples from ] is a numerical constant, which can be estimated arbitrarily well via Monte Carlo simulations.We use the approach from Dicker [2014] to give an estimate of σ, which crucially requires knowing only ∥Y∥ 2 2 , X ⊤ Y, and ‹ X ⊤ Y. Dicker [2014] showed that the estimator is consistent and asymptotic normal in the high-dimensional regime.Specifically, in our setting, we estimate σ by In sum, a choice for λ in Algorithm 4 is this: where Z is independent of everything else.

Output
where the approximation sign ≈ reminds us that the expectation is only approximate.
As in the square-root Lasso case, we observe that the power of our method is not very sensitive to the choice of κ.We use κ = 0.6 in our simulations below.In Appendix E, we provide details of computation of λ and prove that Algorithm 4 maintains FDR control with the computed λ.
Method 2 (pseudo-sum) An alternative way of choosing λ is to adapt the pseudo-summary statistics approach proposed by Zhang et al. [2021].Set r = X ⊤ Y/n and r = P ⊤ r + ∥Y∥2Z/n.The main idea of Zhang et al. [2021] is to generate training summary statistics rt and validation summary statistics rv from r and r based on the training and validation sample sizes nt and nv respectively (in this paper we take nt = 0.8n and nv = 0.2n).Following Zhang et al. [2021], we generate the training summary statistics ñ r r where and the validation summary statistics ñ r r Given a sequence of candidate λ values, we choose that which maximizes an approximation f (λ) of the correlation between the predicted values and the true values on the pseudo-validation set.* Specifically, Zhang et al. [2021] considered the approximation where βt,λ = arg min Therefore, we choose the λ value that maximizes (13) among a set of candidate values.Since the objective function ( 11) is convex in β, we may employ the BASIL framework proposed by Qian et al. [2020], which implements a batch version of the strong rules introduced in Tibshirani et al. [2012].BASIL can be directly applied to compute the solution path of ( 14) efficiently.
Note that there exist other ways to choose the penalty level λ using X ⊤ Y, ∥Y∥2 and n (for example, the Lassosum by Mak et al. [2017]).We do not attempt to claim an optimal strategy.
Connection with the scout procedure It turns out that step 3 of Algorithm 4 is closely related to the scout procedure [Witten and Tibshirani, 2009].The scout procedure defines a family of covariance-regularized regression methods that achieve superior prediction via shrinking the inverse covariance matrix.It includes the Lasso, Ridge and Elastic Net as special cases.In Appendix F, we show that the solution of objective function ( 11) is proportional to that of the scout procedure (with known precision matrix Σ −1 ).This connection provides a justification on why the objective function ( 11) is effective.

GhostKnockoffs with other feature importance statistics
In the previous sections, we presented a feature importance statistic based on summary statistics that leads to better power than the marginal correlation difference statistic.By Proposition 2, GhostKnockoffs techniques can be combined with any other feature importance statistics that i) are based on the summary statistics X ⊤ Y, ∥Y∥ 2 and the sample size n and ii) satisfy the flip-sign property.The procedures generated will still guarantee FDR control.In our simulation studies, we found that using the posterior inclusion probability (PIP) produced by the SuSiE-RSS model [Zou et al., 2022] as the feature importance statistic also results in consistent power improvement over GK-marginal.SuSiE-RSS is based on the Sum of Single Effects (SuSiE) model proposed by Wang et al. [2020], which assumes a Bayesian linear model with true coefficients β represented as the sum of multiple one-hot (random) individual effect vectors.Zou et al. [2022] combines SuSiE with a modified likelihood function to accommodate applications in which only summary statistics are available (see Zou et al. [2022] for details).* We call the resulting procedure GhostKnockoffs with SuSiE-RSS statistic and denote it by GK-susie-rss.We include this method in the simulation section below.

Variants of GhostKnockoffs
The methods we presented so far can be adapted to work with various related procedures.We give three examples below for illustration.

Multi-knockoffs
The knockoffs procedure is a randomized procedure which could produce very different selection sets on different runs.This is especially true when the knockoffs rejection set is small.In fact, the offset on the numerator in (3) implies that knockoffs either rejects more than ⌈ 1 q ⌉ hypotheses, where q is the target FDR level, or rejects nothing.To improve the stability of the knockoffs procedure, Gimenez and Zou [2019] proposed simultaneous multi-knockoffs, which is substantially more stable and powerful than knockoffs when the rejection set is small and maintains FDR control in general.
The idea of Gimenez and Zou [2019] is to create M (instead of one) knockoff copies for every feature so that they jointly satisfy an extended exchangeability condition.* If X ∼ N (0, Σ), Gimenez and Zou [2019] showed that Here, D = diag{s}, and s is obtained by solving a more restrictive convex optimization problem than in (15) which guarantees that G is positive semi-definite (see Gimenez and Zou [2019] for details).In data matrix form, we generate valid M multi-knockoffs by . standard normal entries, and Gimenez and Zou [2019] generalized the knockoffs threshold (3) and the flip-sign property to produce FDRcontrolling rejection sets after generating multiple knockoffs via this procedure.
In the summary statistics settings, upon redefining P, V and s as above and replacing the standard knockoffs filter by the multi-knockoffs filter, Algorithms 2 and 3 produce rejection sets that have the same distribution as those produced by their corresponding versions with individual-level data.For Algorithm 4, we simply need to further replace

Group knockoffs
When variables are highly correlated, selection procedures become conservative.For example, if a non-null variable Xj is highly correlated with a null variable X k , it becomes difficult to reject Xj ⊥ ⊥ Y |X−j.This is an important practical concern because highly correlated features are ubiquitous in many settings, particularly GWAS datasets.
To overcome this challenge, group knockoffs [Dai and Barber, 2016] can be useful; please see Chu et al. [2023], whose algorithms we employ in the data analyses of Section 5.In group knockoffs, the object of inference is shifted from single variables to groups of highly correlated variables.Specifically, suppose we partition p features into g groups and reorder all features such that features of the same group are in adjacent columns of X.The objective is to test group conditional independence hypothesis: where γ ∈ {1, ..., g} denotes a group and Xγ is the vector of features in group γ.When these groups have strong correlation, single-variable knockoffs may struggle to identify signals, but group knockoffs retain power to identify significant groups.As in Section 4.3.1,all methods described in this paper apply to group knockoffs after redefining D to the equivalent version in group knockoffs.In Appendix G, we detail the construction of group knockoffs and examples of importance scores at the group level for inference.

Conditional randomization test
The conditional randomization test (CRT) [Candès et al., 2018] is an alternative method to test the conditional independence hypotheses Hj : Xj ⊥ ⊥ Y | X−j for 1 ≤ j ≤ p.By generating a valid 'CRT p-value' pj for each hypothesis Hj, existing multiple testing procedures, including the Benjamini-Hochberg procedure [Benjamini and Hochberg, 1995] and the selective SeqStep+ filter [Li and Candès, 2021], can be used to simultaneously test H1, . . ., Hp with FDR control.* As shown in Candès et al. [2018] and Wang and Janson [2021], doing so can improve the power of multiple testing with greater computational complexity.
In Appendix H, we introduce Ghostknockoffs for CRT (GhostCRT ), which adopts techniques introduced in this paper to the framework of CRT.

Numerical simulations
We conduct simulations on synthetic data as well as semi-synthetic data generated from a real-world genetic dataset.Specifically, we apply GhostKnockoffs with pseudo-lasso statistic (GK-pseudolasso, defined in Algorithm 4 with tuning parameter λ chosen by either lasso-min or pseudo-sum from Section 4.2.2) and GhostKnockoffs with SuSiE-RSS statistic (GK-susie-rss, defined in Section 4.2.3).We compare their performance with GhostKnockoffs with marginal correlation difference statistic (GK-marginal, defined in Section 2) and the knockoffs procedure with (cross-validated) Lasso coefficient difference statistic based on individual-level data (KF-lassocv).We also demonstrate empirically the robustness of our procedures by showing the FDR control when only an estimate of the true covariance matrix Σ is available and when the features are discrete.

Simulations based on real-world genetic data
To mimic the dependency structure among features in real-world applications, we generate synthetic data based on the whole genome sequencing (WGS) data from the Alzheimer's Disease Sequencing Project (ADSP).The data are obtained from the ADSP consortium following the SNP/Indel Variant Calling Pipeline and data management tool (VCPA) [Leung et al., 2019].The ADSP WGS data records counts of minor alleles of genetic variants over 16,906 individuals.Using reference populations from the 1000 Genomes Consortium [The 1000 Genomes Project Consortium, 2015], we estimate ancestry rates of each individual by SNPWeights v2.1 [Chen et al., 2013] and extract 6,952 individuals with estimated European ancestry rate greater than 80%.We further restrict our simulations to 2,000 randomly selected genetic variants within 0.5Mb distance to the APOE gene (chr19:44909011-45912650; hg38), whose ε2 allele and ε4 allele are known to be respectively the strongest genetic protective factor and the strongest genetic risk factor for Alzheimer's disease [Serrano-Pozo et al., 2021, Belloy et al., 2023], and with minor allele frequency (MAF) larger than 0.01.Since our simulations focus on performance at identifying relevant clusters of tightly linked variants, we simplify the simulation design by pruning variants to eliminate pairs with absolute correlation greater than 0.75.To do so, we first compute the correlation matrix [cor(Xj, X k )]2000×2000 of the 2,000 selected variants over the 6,952 extracted individuals using the shrinkage estimate in the R package corpcor [Schäfer and Strimmer, 2005] and apply hierarchical clustering (single-linkage with cutoff value 0.25) on the distance matrix [1 − |cor(Xj, X k )|]2000×2000.As a result, we obtain 512 variant clusters such that pairwise correlation between any pair of variants from different clusters is in [−0.75, 0.75].By randomly choosing one representative variant from each cluster, we include p = 512 tested genetic variants in the simulation study.
For each replicate, we obtain synthetic data by randomly sampling n = 3, 000 individuals without replacement and collecting the sampled individuals' records on the p = 512 tested genetic variants as the n × p covariate matrix X.We further sample another n = 3, 000 individuals without replacement as the reference panel on which we compute the correlation matrix Σ using the shrinkage estimate in the R package corpcor [Schäfer and Strimmer, 2005].Based on the covariate matrix X, we generate the response vector Y = (Y1, . . ., Yn) ⊤ from either the linear model (continuous response), Yi = β1Xi1 + ...
Specifically, β0 under the mixed-effect logit model is − log(9) so that the prevalence (or the expected proportion of Yi = 1) is 10%.ϵ C i 's and ϵ B i 's reflect variation due to unobserved covariates.Only 10 randomly selected coefficients βj are nonzero, with value βj , where mj is the MAF of the j-th variant.
With the relevant summary statistics computed, we apply GK-pseudolasso and GK-susie-rss and compare their performances with GK-marginal and KF-lassocv.
Over 1000 replicates under both the linear model and the mixed-effect logit model, average power and FDR of different methods with respect to different target FDR levels are visualized in Figure 3.Under both models, we observe that GK-pseudolasso with both ways of selecting the tuning parameter and GK-susie-rss are uniformly more powerful than GK-marginal.The performance of the proposed methods is very close to that of KF-lassocv.Despite the covariance matrix being estimated using an independent sample and the entries of X being discrete, the FDRs of our proposed methods are controlled in both settings, suggesting the robustness of our methods.

GhostKnockoffs with discrete features
We note that discrete covariates do not follow a Gaussian distribution.However, the knockoffs procedure ensures FDR control whenever the feature importance statistics Wj = w(Tj, Tp+j), where w is an anti-symmetric function, and T ∈ R 2p is distributionally invariant upon swapping Tj with Tj+p for each null j.Using Lemma 1, we know that Algorithm 4 controls the FDR if swapping the j−th entry of Z = X ⊤ Y and the j−th entry of Z = P ⊤ X ⊤ Y + ∥Y∥2Z does not change their joint distribution for each null j.In Appendix J, we visually demonstrate the approximate preservation of this distributional invariance.This, along with the robustness of knockoffs [Candès et al., 2018, Barber et al., 2020], helps in explaining why we have not observed FDR inflation with discrete covariates.

Independent features
We revisit the setting from Section 3.5.1 in which Σ = Ip.For the pseudo-sum method for GK-pseudolasso, we optimize over λ using a grid of 100 candidate values interpolating between λmax and λmax/1000 linearly in log scale, and is the minimal λ value that shrinks all the coefficients to zero.To calculate E[∥R ⊤ ϵ∥∞] for the lasso-min parameter method, we use a Monte Carlo estimate averaged over 200 samples.The target FDR is 20%.Each point represents an average over 200 replications.
Note that when Σ = Ip, the solution to ( 15) is D = Ip.It is easy to see that (11) gives where the soft-threshold operator S λ (x) = sign(x)(|x| − λ)+ is applied coordinate-wise.Therefore, the method in Section 4.2 soft-thresholds the marginal correlation of X and Y.
As shown in Figure 4, all three new methods (GK-pseudolasso with lasso-min/pseudo-sum and GK-susie-rss) consistently outperform GK-marginal, and the FDR is always controlled at the expected level, as theoretically guaranteed.Simulations with Independent Covariates KF-lassocv GK-marginal GK-pseudolasso (lassomin) GK-susie-rss GK-pseudolasso (pseudosum) As n/p grows, we see that the three new methods have power closer to KF-lassocv.This is further demonstrated in additional simulations in Appendix I.

AR(1) features
Figure 5 shows the corresponding plots when the covariate matrix is generated from an AR(1) distribution.We found similar patterns to those with independent features.The power of all methods drops when the autocorrelation coefficient increases, as it is then harder to separate true signals from other variables.Simulations with AR(1) Covariates KF-lassocv GK-marginal GK-pseudolasso (lassomin) GK-susie-rss GK-pseudolasso (pseudosum)

Application to meta-analysis for Alzheimer's disease
To illustrate the empirical performance of the methods in detecting genetic variants associated with Alzheimer's disease (AD), we apply them to a meta-analysis of nine large-scale array-based genome-wide association and wholeexome/-genome sequencing studies for AD.We include the details of the nine studies in Appendix K.As all studies share the same focus on individuals with European ancestry, we perform a meta-analysis by aggregating their Z-scores and obtain the meta-analysis Z-score Zmeta (see Appendix L for details).In addition, we obtain the block-diagonal covariance matrix Σ with respect to approximately independent linkage disequilibrium blocks provided by Berisa and Pickrell [2016].Within each block, we use the UK Biobank directly genotyped data as the reference panel and compute the covariance matrix via the Pan-UKB consortium (https://pan.ukbb.broadinstitute.org)with details in Appendix M. To improve the power in the presence of tightly linked variants, we apply the group knockoffs construction on top of the GhostKnockoff algorithm, as detailed in Section 4.3.2.Finally, we implement GK-pseudolasso with tuning parameter chosen by the lasso-min method on the meta-analysis Z-score Zmeta and the covariance matrix Σ.To stabilize the GhostKnockoffs procedures, we use M = 5 multi-knockoffs as defined in Section 4.3.1.
Figure 6: Graphical representation of the feature importance statistics after applying the GK-pseudolasso on a meta-analysis of AD.Each point represents a group of genetic variants.With an target FDR level of 0.1, identified groups are highlighted in blue or purple.For each locus with at least one identified group, the name of the locus is presented at the variant group with the largest importance statistic (highlighted in purple).Variant density is shown at the bottom of plot (number of variants per 1Mb).
Figure 6 presents the result of the meta-analysis of the nine studies via our proposed method with target FDR level 0.1.Here, we specify loci based on variant groups and annotate two loci as different loci if they are 1 Mb away from each other.We adopt the most proximal gene's name as the locus name.* As shown by Table 1 in Appendix N, GK-pseudolasso identifies variant groups in 42 and 63 loci when the target FDR level is 0.1 and 0.2 respectively, substantially more than GK-marginal (10 and 17 when the target FDR level is 0.1 and 0.2, respectively).This is consistent with our simulation results in Section 4.4.In addition, we observe from Table 1 that GK-susie-rss identifies fewer loci (35 and 47 when the target FDR level is 0.1 and 0.2, respectively), although it exhibits similar power in simulation studies.In Appendix O, we analogously visualize results of the meta-analysis via conventional marginal association test (with p-value cutoff 5 × 10 −8 ), GK-marginal (with target FDR level 0.10), and GK-susie-rss (with target FDR level 0.10).
Table 2 in Appendix N shows the top variant with the largest feature importance statistic in each identified group.Most discoveries exhibit relatively strong marginal associations (marginal p-value ≤ 0.05) in individual studies and the same direction of effects across all studies.Although some loci have an opposite direction of effect in one individual study, such effects are not significant.The consistency across individual studies supports the validity of the proposed method in discovering putative causal variants.In addition, we observe that all top variants of identified groups have small meta-analysis p-values (less than 0.05), though some are not smaller than the stringent genome-wide threshold (5 × 10 −8 ) in marginal association tests with FWER control.
To further investigate whether the identified groups are functionally enriched, we apply a SNP-to-gene linking strategy proposed by [Gazal et al., 2022] to link the top variants of identified groups to the genes that they potentially regulate.Out of 63 top variants, we find that 34 (54.0%) can be mapped with functional evidence (e.g., being an expression quantitative trait locus, in a Hi-C linked enhancer region, near the exon of a gene, etc.), where the proportion is significantly higher than the average percentage of the background genome (28.6%).In summary, the proposed method can identify functional genetic variants with weaker statistical effects missed by conventional association tests.
* Specifically, we consider the variant group with the largest group knockoff feature importance statistic within a locus, and then map the locus to the most proximal gene of the variant within the group that has the highest knockoff importance score.

Discussion
This paper introduced novel approaches for performing variable selection with FDR control on the basis of summary statistics.We proposed methods for testing conditional independence hypotheses from summary statistics alone.For the methods from Section 4, all we need are essentially the marginal correlations between X and Y , * which, at first sight, may appear surprising.Our arguments rely on the assumption that the covariates follow a Gaussian distribution, as well as on the linearity and rotational invariance of Gaussian distributions.Since our methods are based on the knockoffs procedure, they do not require any knowledge about the model of Y given X.Our methods extend, and generally give better power than, the work by He et al. [2022] by employing penalized regression to produce the measure of feature importance.The techniques employed in this paper provide a wrapper that can be combined with a variety of feature selection methods, yielding knockoffs versions that guarantee FDR control.
We applied our methods to genetic studies, in which summary statistics are typically available.Due to linkage disequilibrium, the application of our methods to individual genetic variants may yield conservative results.In a parallel work Chu et al. [2023], we have developed tools for constructing group knockoffs efficiently and effectively.When combined, our methods offer a powerful new approach to controlled variable selection in GWAS.This is further supported in our companion work He et al. [2023], where we see the methods in this paper led to significant scientific discoveries.* Along with ∥Y∥ 2 and n.

A Computation of free parameters s
In this paper, we use the semidefinite program (SDP) construction of second-order knockoffs Candès et al. [2018].Without loss of generality, we assume that columns of the data matrix X have been standardized with mean 0 and variance 1 such that diagonal entries Σ are 1.As a result, s is the solution of the convex optimization problem.
Other methods to compute s include the minimum variance-based reconstructability (MVR) construction [Spector and Janson, 2022] and maximum entropy (ME) construction [Gimenez andZou, 2019, Spector andJanson, 2022], which are all compatible with our methods in this paper.
B Equivalence of GhostKnockoffs and the Gaussian knockoff sampler in sampling the knockoff Z-score Z s In this section, we summarize the proof of He et al. [2022] that Zs computed by ( 6) satisfies ( 5) as follows.
Lemma 2. [He et al., 2022] For any P and V computed in step 3 of Algorithm 1, we have where Zs is computed by (6) and ‹ X is the output of Algorithm 1.
Proof.By step 5 of Algorithm 1, we have ‹ X = XP + EV 1/2 , where E is an n by p matrix with i.i.d.standard Gaussian entries, independent of X.Therefore,

C Proof of Proposition 1
To prove Proposition 1, we need to first prove Lemma 3.
Lemma 3. Let Z1 and Z2 be two real n by p matrices.For any n and p, if , where U ∈ R p×r is an orthogonal matrix such that U ⊤ U = Ir, Λ ∈ R r×r is diagonal with positive entries and r is the rank of Z ⊤ 1 Z1.In other words, we perform eigen-decomposition of and remove all zero eigenvalues and their corresponding eigenvectors.Note that UU ⊤ is a projection matrix that projects any vector onto colspace(U), the column space of U.
Because U = (UΛU ⊤ )UΛ −1 , we also have As a result, we have colspace(UΛU ⊤ ) = colspace(U).Thus, for k = 1, 2, UU ⊤ is a projection matrix that projects any vector onto the column space of Thus, we have are both orthogonal matrices.Thus, we have and thus Thus, these exists an orthogonal matrix Q = V1V ⊤ 2 such that Z1 = QZ2.
We can then prove Proposition 1 as follows.By Lemma 3, since [ q Let E ∈ R n×p be a matrix with i.i.d.standard Gaussian entries, we have QE is also a matrix with i.i.d.standard Gaussian entries (i.e.
. Therefore, we focus on the third, fifth and sixth arguments of T where In this section, we give details on how to construct [ q We consider two cases as follows.
Case 1 where U1 = [u1 . . .un], and Dn = diag(d1, . . ., dn).Under this case, we let [ q Case 2 (n ≥ p + 1): Under this case, we let E Computation of the tuning parameter λ for the lasso-min method Suppose we had access to individual level data such that Gaussian knockoffs ‹ X can be constructed, we can follow the method of Dicker [2014] to estimate the noise level σ by We could then compute λ = κ where R ∈ R n×2p is a data matrix whose rows are i.i.d.samples from N (0, G), and ϵ ∼ N (0, In) is independent of R. In the summary statistics setting, we replace ‹ X ⊤ Y in ( 16) by P ⊤ X ⊤ Y + ∥Y∥2Z, where P and Z are obtained in Algorithm 4. The expectation E[∥R ⊤ ϵ∥∞] can be computed using Monte Carlo integration.However, when both n and p are very large, sampling R and ϵ becomes too time-consuming.Observing that where R ⊤ ϵ ∥ϵ∥ 2 ∼ N (0, G) and ϵ are independent, we have Therefore, we may approximate where L is the Cholesky decomposition of G and Z ∼ N (0, I2p).
In practice, the simulated ∥LZ∥∞ usually concentrates around its mean as shown in Figure 7. Thus, only several Monte Carlo samples are needed to accurately estimate E[∥LZ∥∞], and we draw 10 Monte Carlo samples throughout numerical experiments of this paper.
Next, we prove that Algorithm 4 maintains FDR control when λ is computed as described in Section 4.2.2.By Proposition 2, it suffices to show that (11) with the computed λ produces feature importance statistics that satisfy the flip sign property.By λ ≈ κ it suffices to show that σ0 is invariant to swapping variables with their knockoffs [Candès et al., 2018].
Let Πj ∈ R 2p×2p be the permutation matrix that swaps the j-th column with the (j + p)-th column of a matrix.Thus, we have suggesting σ0 is invariant to swapping variables with their knockoffs [Candès et al., 2018].Therefore, the FDR of Algorithm 4 is controlled when λ is computed as described in Section 4.2.2.Since all variables in ‹ X are null, in practice we may replace î X ‹ X ó by X, G by Σ and 2p + n + 1 by p + n + 1 in ( 16) to reduce the dimension when estimating σ.Although this would, in theory, break the flip-sign property required for FDR control, no FDR inflation is observed in our simulations.

F Connection with the scout procedure
In this section, we explain the connection of the feature importance statsitic defined in Algorithm 4 and the scout procedure [Witten and Tibshirani, 2009].
For covariates X ∈ R p and response Y ∈ R, Witten and Tibshirani [2009] assume that The population linear regression coefficient of Y on X, which induces a linear predictor that achieves the minimal mean squared prediction error, is given by Let S be the empirical covariance matrix of X and Y , they consider the following covariance-regularized regression approach to estimate β, Here, J1 and J2 are two penalty functions.The first two steps are to appropriately separate true conditional correlations from those purely due to noise.As shown in Witten and Tibshirani [2009], when J2(Θ) = λ2∥Θ∥1 (resp.λ2∥Θ∥ 2 2 ), the solution to step 3 is proportional to the solution of where GXX is the inverse of the solution ΘXX from step 1.In other words, the Lasso corresponds to the setting that J1 = 0 and GXX = SXX .Witten and Tibshirani [2009] consider various settings in which they demonstrate the superiority of the scout procedure over the Lasso, Ridge and Elastic Net.In the setting of Section 4, we have cov Therefore, the objective function ( 11) corresponds to the case that the true ΘXX is used in step 1 (here we include both X and ‹ X as explanatory variables).

G Construction of group knockoffs and examples of importance scores at the group level
For group knockoffs, we test the group conditional independence hypothesis: Compute γ j = Σ −1 −j,−j Σ −j,j ∈ R p−1 and v j = Σ j,j − Σ j,−j Σ −1 −j,−j Σ −j,j .

6:
end for 7: Compute the CRT p-value p j via (17) with T (X j , X −j , Y) = |X ⊤ j Y| and T ( ‹ X b j , X −j , Y) = Z b j .8: end for 9: Output: Selection set by conducting existing multiple testing procedures on CRT p-values p 1 , . . ., p p .
Algorithm 6 GhostKnockoffs with Penalized Regression for CRT: Known Empirical Covariance ] by eigen-decomposition or Cholesky decomposition.

7:
end for 8: Compute the CRT p-value p j via (17) and replacing ‹ X b j by ‹ q X b j with feature importance statistic defined by T (X j , X −j , Y) = | βj |, where ( βj , β−j ) = arg min 9: end for 10: Output: Selection set by conducting existing multiple testing procedures on CRT p-values p 1 , . . ., p p .
As ( 18) is a special case of (4) where • P is obtained by substituting the (j, j)-entry and other entries in the j-th column of Ip by 0 and γj respectively; • V is a matrix of zeros expect the (j, j)-entry equals vj, all theoretical results in Sections 2-4 remain true for the GhostCRT.
Remark 1.In Algorithm 6, the tuning parameter λ is allowed to depend on X ⊤ X, X ⊤ Y, Y ⊤ Y and n.We may also use the square-root Lasso or the Lasso-max importance statistic as outlined in Sections 3.3 and 3.4.

I Additional results for Section 4.4.2
To further demonstrate the effect of sample size on the new GhostKnockoffs methods in comparison to the individual level knockoffs with (cross-validated) Lasso coefficient difference, we consider additional experiments with p = 600 and n = 600/1800/3000 under the same setting of Section 4.4.2.Note that the noise level scales in the order of √ n such that the signal to noise ratio does not change dramatically.From Figure 8, we observe that as n increases, all three new methods proposed in this paper have comparable power with KF-lassocv and outperform GK-marginal [He et al., 2022] consistently, with FDR controlled in all cases.Simulations with Independent Covariates KF-lassocv GK-marginal GK-pseudolasso GK-susie-rss GK-pseudolasso (pseudosum) J Supplementary plots for Section 4.4.1 Let Z = X ⊤ Y and Z = P ⊤ X ⊤ Y + ∥Y∥2Z be defined as in Algorithm 4. Note that for Algorithm 4 to control the FDR, it suffices to require that swapping the j−th entry of Z with the j−th entry of Z does not change the joint distribution of (Z, Z) for each null j.
By the Central Limit Theorem, short sets of entries (e.g.single entries, pairs, and triples etc.) of (Z, Z) are approximately Gaussian.Additionally, in Figure 11, we show empirically that the covariance of (Z, Z) (approximately) satisfies the required swap-invariance for null positions.These approximations, coupled with the robustness of the knockoff framework, empirically yield the FDR control.This is similar to the robustness of second-order knockoffs observed empirically in Candès et al. [2018].
In the setting from Section 4.4.1, Figure 9 depicts the ordered empirical values of Zj (respectively Zj) plotted against an equal-size ordered random sample from a Gaussian distribution with matching mean and variance as the empirical mean and variance of Zj (respectively Zj), for three randomly selected indices.This comparison is based on the 1000 sub-sampled data replications from Section 4.4.1.In Figure 10, we overlay the plots for all indices.We observe that Zj and Zj approximately follow Gaussian distributions.In Figure 11, we present the scatter plots of relevant empirical covariances.We observe that all the points roughly concentrate around the y = x line.This shows the approximate swap-invariance of Z and Z (for null indices).

K Details of the nine studies in Section 5
Section 5 considers the following nine studies for Alzheimer's disease: 1.The genome-wide association study performed by Huang et al. 2. The genome-wide meta-analysis of clinically diagnosed AD and AD-by-proxy by Jansen et al. [2019].
3. The genome-wide meta-analysis of clinically diagnosed AD by Kunkle et al. [2019].
5. In-house genome-wide associations study of 15,209 cases and 14,452 controls aggregating 27 cohorts across 39 SNP array data sets, imputed using the TOPMed reference panels [Belloy et al., 2022a].
6.A whole-exome sequencing analyse of data from ADSP by Bis et al. [2020].
7. A whole-exome sequencing analyse of data from ADSP by Le Guen et al. [2021].

L Calculation of meta-analysis Z-score
Based on Z-scores Z1, Z2, ..., ZK from K studies, we adopt the definition in He et al. [2022] that meta-analysis Z-score with overlapping samples is Specifically, • optimal weights w1, . . .• H = diag{h1, ..., hp} is a diagonal matrix where hj = ( k l w k w l c kj c lj cor.S kl ) −1/2 (j = 1, . . ., p); • cor.S kl is the study correlation between the k-th study and the l-th study.
In practice, when calculating cor.S kl , we only use variants whose Z-scores are bounded in [−1.96, 1.96] in both the k-th study and the l-th study to eliminate the impact of polygenic effects.This meta-analysis approach is a generalization of the METAL method proposed by Willer et al. [2010].

M Obtaining the covariance matrix Σ in meta-analysis for AD
To perform meta-analysis for AD, we need a suitable estimate of the covariance matrix Σ.In this paper, we adopt strategies in He et al. [2023] and Chu et al. [2023] as follows.
We first download the covariance matrix from the Pan-UKB consortium (https://pan.ukbb.broadinstitute. org), which contains about 24 million variants across the human genome derived from about 500, 000 British samples.We then extract p = 650, 576 variants which satisfy the following three conditions: (a) the variant is recorded in the UK Biobank genotype array, (b) its MAF exceeds 0.01, (c) its reference/alternate allele pair matches with the ones listed in all the nine studies in meta-analysis.Based on the covariance matrix of extracted variants, we further partition extracted variants into 1703 quasi-independent blocks using the partition given by Berisa and Pickrell [2016].Finally, we compute the block-diagonal covariance matrix where Σ l is the shrinkage estimator of the covariance matrix of variants in the l-th block using the R package corpcor [Schäfer and Strimmer, 2005].To ensure that all blocks Σ1, ..., Σ1703 are positive definite, we perform eigen-decomposition and increase all their eigenvalues not larger than 10 −5 to 10 −5 .

N Supplementary tables of meta-analysis for AD
Tables 1 and 2 provide more details of the meta-analysis for AD in Section 5. Specifically, Table 1 presents the number of loci, average signals per locus, standard deviation of the number of signals per locus, average groups per locus, and standard deviation of the number of groups per locus identified by conventional marginal association test, GKmarginal, GK-pseudolasso, and GK-susie-rss.Here, the p-value threshold of the conventional marginal association test is 5 × 10 −8 , and GK-pseudolasso uses the tuning parameter chosen by the lasso-min method in Section 4.2.2.For GK-marginal, GK-pseudolasso, and GK-susie-rss, we display results with respect to target FDR levels 0.05, 0.1, and 0.2.Table 2 provides details of top variants of identified loci given by GK-pseudolasso (target FDR level: 0.20), including their positions (columns "Chr." and "SNP"), their reference alleles (column "Ref.") and alternative alleles (column "Alt."),genes that they potentially regulate (column "TopS2GGene"), their closest genes, their Z-scores from different individual studies, their meta-analysis Z-scores, their feature importance scores (W ), and the marginal p-values obtained from meta-analysis Z-scores.
O Supplementary figures of meta-analysis for AD Analogous to Figure 6 in Section 5, Figures 12, 13 and 14 respectively present Manhattan plots of the meta-analysis of the nine studies via conventional marginal association test (with p-value cutoff 5×10 −8 ), GK-marginal (with target FDR level 0.10), and GK-susie-rss (with target FDR level 0.10).
The conventional marginal association test selects many feature groups because it focuses on marginal correlations between feature groups and the response while ignoring spurious correlation induced by linkage disequilibrium.This is shown in Figure 12, where the conventional marginal association test tends to select many nearby loci.This issue is alleviated by the GhostKnockoffs approach that tests conditional independence as seen in Figures 6, 13, and 14.

P Running Lasso on binary responses
In genetic datasets, the response Y is often binary.Performing Lasso or Lasso-type regressions on binary response may sound unreasonable since it violates the usual linear model assumption.One might assume that utilizing penalized logistic regression to generate feature importance statistics would be much more effective.However, a bit surprisingly, we demonstrate that this intuition may not be correct through the following two simulations.
For the first column of Figure 15, we generate Xi iid ∼ N (0, 1 √ n Ip), and, conditional on Xi, P(Yi = 1) = 1 1+e −β ⊤ X i and P(Yi = 0) = 1 − P(Yi = 1), where n = 1000 and p = 300.We create β by uniformly randomly selecting 30 coordinates to be non-zero.The signs of these non-zero coordinates are assigned to be either positive or negative with equal probability.The dark curve represents the knockoffs procedure with Lasso coefficient difference statistic (with tuning parameter chosen by cross-validation), i.e., KF-lassocv.The red curve represents the knockoffs procedure with coefficient difference statistic generated by L1-penalized logistic regression.We vary the signal amplitudes such that we observe relatively complete power profiles below.The target FDR is 0.1.Each point on the curves represents an average over 200 replications.For the second column of Figure 15, we show the result for AR(1) features.Here, n = 600, p = 200 and the signal amplitude (i.e., the magnitude of non-zero β values) is fixed to be 0.5.Otherwise, the simulation setting is exactly the same as the independent case.We observe that the two methods considered have almost the same power and FDR, so the use of penalized logistic regression does not meaningfully affect the results.
2 , and Σ. 2: Compute s, P, and V as in Algorithm 1. 3: Compute the feature importance statistics W = |Z s | − | Z s |, where Z s is generated according to (6).4: Input W into the knockoffs selection procedure.5: Output: Knockoffs selection set.

Figure 1 :Figure 2 :
Figure 1: Power and FDR plots for independent features and a Gaussian linear model with varying dimensions.Each point is an average over 200 replications.
S2) be the rejection set obtained from applying the knockoffs filter on W (resp. " W).Then S1 | X, Y d = S2 | X, Y. Thus, if W obeys the flip-sign property, both procedures have equal FDR at most equal to q.

Figure 3 :
Figure 3: Average power and FDR over 1000 replications with respect to different target FDR levels in simulations based on genetic data, where features are genotypes of existing patients, and the response is simulated from a linear model (continuous response) or a mixed-effect logit model (binary response).

Figure 4 :
Figure 4: Power and FDR plots for independent features and a Gaussian linear model with varying dimensions.Each point is an average over 200 replications.

Figure 5 :
Figure 5: Power and FDR plots for AR(1) features and a Gaussian linear model with varying dimensions.Each point is an average over 200 replications.
Z.C. would like to thank Kevin Guo and Amber Hu for helpful discussions.Z.C. was supported by the Simons Foundation under award 814641.Z.H. was supported by NIH/NIA award AG066206 and AG066515.T.M. was supported by a B.C. and E.J. Eaves Stanford Graduate Fellowship.C.S. was supported by the grants NIH R56HG010812 and NSF DMS2210392.E.J.C. was supported by the Office of Naval Research grant N00014-20-1-2157.

Figure 8 :
Figure 8: Power and FDR plots for independent features and a Gaussian linear model with varying sample sizes and a fixed feature dimension.Each point shown is an average over 200 replications.

Figure 9 :
Figure 9: QQ plots of Z j 's (left) and Zj 's (right) against Gaussian samples with matching mean and variance for three randomly sampled indices.

Figure 10 :
Figure 10: QQ plots of Z j 's (left) and Zj 's (right) against Gaussian samples with matching mean and variance for all indices overlaid.

Figure 11 :
Figure 11: Scatter plots for relevant empirical covariances.Each correlation is estimated from 1000 samples drawn as described in Section 4.4.1.

Figure 12 :
Figure 12: Graphical illustration of the result by applying conventional marginal association test on metaanalysis for AD.The dotted line represents the conventional genome-wide p-value threshold of 5 × 10 −8 .P-values are truncated at 10 −50 for better visualization.The results are obtained from the meta-analysis p-values calculated based on Section L. Variant density is shown at the bottom of plot (number of variants per 1Mb).

Figure 13 :
Figure13: Graphical illustration of the result by applying the GK-marginal on meta-analysis for AD.Each point represents a group of genetic variants.With respect to the target FDR level 0.1, points of identified groups are highlighted in blue or purple.For each locus with at least one identified group, the name of the locus is presented at the variant group with the largest importance statistic (highlighted in purple).Variant density is shown at the bottom of plot (number of variants per 1Mb).

Figure 14 :
Figure14: Graphical illustration of the result by applying the GK-susie-rss on meta-analysis for AD.Each point represents a group of genetic variants.With respect to the target FDR level 0.1, points of identified groups are highlighted in blue or purple.For each locus with at least one identified group, the name of the locus is presented at the variant group with the largest importance statistic (highlighted in purple).Variant density is shown at the bottom of plot (number of variants per 1Mb).

Figure 15 :
Figure 15: Power and FDR plots when the response is generated by a logistic regression model.Each point is an average over 200 replications.

Table 1 :
Summary of results by applying different methods on meta-analysis for AD

Table 2 :
Details of top variants of identified loci given by GK-pseudolasso (target FDR level: 0.20).